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DIGEST 


This  investigation  was  initiated  in  order  to  determine  the  reason 
for  apparent  shortage  of  I,  J,  K,  and  L's  in  the  related  term  reference  of 
the  Engineers  Joint  Council  (EJC)  thesaurus.  We  decided  to  expand  into  a 
study  of  the  statistical  characteristics  of  first  letters  of  related  terms  in 
four  thesauri. 

In  the  course  of  the  investigation,  we  found  that,  in  general,  the 
distribution  of  descriptors  in  four  thesauri  followed  the  pattern  previously 
observed  by  Ohlman. 

The  initial  letter  frequency  for  related  terms  within  given  letters 
did  not  follow  this  pattern.  The  major  reason  seemed  to  be  letter  within 
letter  redundancy.  Related  first  letters  repeating  the  first  letter  of  their 
own  descriptors  are  over  four  times  expected  when  considered  against  Ohlman 
and  the  descriptors  themselves. 

For  the  EJC  thesaurus,  we  found  a  wide  difference  between  the 
frequencies  of  the  first  letters  of  the  related  terms  within  given  sections  and 
the  frequencies  of  the  descriptors.  The  EJC  thesaurus  was  not  unusual  in 
this  characteristic  when  compared  to  the  other  thesauri.  It  was,  however, 
more  repetitious  of  the  same  first  letters  than  the  others. 

We  began  work  on  a  study  of  the  extent  to  which  this  letter 
redundancy  is  a  result  of  word  redundancy.  We  found  that  word  repetition 
only  accounted  for  one-half  of  the  "excess"  of  first  letters.  There  are,  then, 
factors  additional  to  word  repetition  contributing  to  the  "excess." 

The  EJC  and  the  ASTIA  thesauri  were  heaviest  in  word  repetition. 
The  EJC  thesaurus  was  heaviest  in  letter  repetition  and  ir  the  additional 
factor  contributing  to  the  excess  of  letters  within  letters. 
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FIRST  LETTER  FREQUENCY  OF  RELATED  TERM 
REFERENCES  IN  FOUR  TECHNICAL  THESAURI 

I.  INTRODUCTION. 

Indexers  and  catalogers  have  always  used  "see  also"  and  related 
term  references  as  suggestions,  in  uncovering  more  appropriate  terms  and 
as  a  frame  work  for  the  terms  they  choose.  These  terms  have  always  been 
used  during  input  and  retrieval  in  both  manual  and  mechanized  systems. 

In  present  mechanized  systems  for  retrieval  of  scientific  and 
technical  information,  related  term  references  play  a  major  role.  They  are 
usually  consulted  prior  to  the  search.  They  may  be  incorporated  as  a  "table - 
look  up"  or,  more  commonly,  they  may  be  incorporated  into  the  structure  of 
an  external  thesaurus.  Regardless  of  how  they  are  incorporated,  they  are  at 
present  a  principal  method  by  which  the  searcher  can  discover  alternative 
paths  for  retrieval  of  documents  and  information.  The  user,  however,  has 
little  opportunity  for  real-time  reinterrogation  as  he  revises  hi6  request 
based  on  the  previous  answer.  He  has  little  opportunity  for  browsing  or 
other  interaction. 

Interest  has  grown  in  supplying  users  with  devices  that  not  only 
provide  them  with  access  to  documents  and,  hence,  information  contained  in 
retrieval  systems,  but  also  with  suggestions  for  reformulation  or  alternative 
paths  during  the  interrogation  process.  Newer  proposals  concern  such  devices 
as  consoles  that  provide  an  opportunity  for  rapid  reaccess.  *  This  would  tend 
to  increase  the  significance  of  these  references. 

Despite  their  present  and  growing  significance,  it  is  safe  to  say 
that  "see  also"  and  related  term  references  have  been  rather  neglected  in  the 
documentation  literature.  As  opportunities  grow  for  dynamic  interaction 
through  rapid  reaccess  and  for  reasons  pointed  out  above,  related  term  refer¬ 
ences  are  expected  to  become  increasingly  significant. 

For  this  reason,  we  have  begun  an  inquiry  into  the  characteristic* 
of  these  terms. 

In  this  work,  we  are  not  attempting  an  evaluation  of  the  effect  cf  in¬ 
corporating  related  term  references.  This  can  perhaps  be  better  done  by  a 
Cranfield-type  test,  where  the  effects  of  incorporating  these  references  may  be 
measured.  In  this  paper,  we  are  interested  in  finding  out  the  characteristics  of 
related  term  references  and  the  differences  in  practice  with  regard  to  their 
incorporation.  We  trust  that  knowing  what  we  are  evaluating  and  what  we 
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are  determining  will  be  of  eventual  use  in  analysis,  interpretation,  and 
redesign. 

Our  decision  to  study  related  term  references  came  about  in  an 
interesting  way.  We  present  this  as  historical  background. 

II.  HISTORICAL  BACKGROUND. 

Two  years  ago  one  of  the  investigators  noticed  what  seemed  to  be 
an  anomaly  in  the  Engineers  Joint  Council  (EJC)  thesaurus.^  It  s  emed  that 
first  letter  series  within  the  W  section  of  the  EJC  thesaurus  were  "broken - 
off" within  the  related  term  references.  Thus,  for  the  word  "weathering" 
there  was  a  rather  continuous  series  from  F  to  Z,  but  nothing  for  A  to  E;  for 
"weather  radar"  there  was  a  rather  continuous  series  from  A  to  Q,  but  noth¬ 
ing  from  R  to  Z,  and  so  on  for  other  cases.  In  order  to  test  whether  continu¬ 
ous  series  were  indeed  "broken-off, "  it  was  decided  to  tabulate  continuous 
missing  series.  Table  1  presents  some  of  the  results  of  this  investigation. 

The  table  seemed  to  reveal  that  the  "missing  series,"  if  one 
could  be  found,  contains  the  letters  I,  J,  K.  and  L.  The  EJC  thesaurus 
seemed  deficient  in  series  containing  these  letters.  Perhaps  this  was  nor¬ 
mal.  Perhaps  the  letters  I,  J,  K,  and  L  are  less  used  as  first  letters  in 
normal  nontechnical  discourse,  in  techical  discourse,  in  technical  indexes, 
in  subject  word  lists,  and  in  similar  tools.  A  spot  check  ot  first  letter  fre¬ 
quency  tables  indicated  that,  while  one  would  indeed  expect  a  scant  represen¬ 
tation  of  J's  and  K's,  the  J's  and  L's  should  be  well  represented.  We  had  to 
look  elsewhere. 

While  searching  for  an  answer,  we  decided  to  consider  the  possi¬ 
bility  that  instead  of  there  being  a  "deficiency"  of  certain  letters  th-re  was 
actually  a  surplus  of  others.  If  one  or  more  letters  had  an  extra-’* 
representation,  the  others  would  seem  to  have  a  small  repres.  nta  >  by- 
comparison. 

It  teemed  that  there  might  be  a  surplus  of  related  term  W  s 
within  the  W  descriptors.  To  confirm  this,  we  counted  the  155)  related 
references  within  the  W  section.  We  found  that  £0.£%  of  the  first  letters 
within  the  W  descriptors  began  with  the  letter  W. 

We  now  had  the  problem  of  finding  a  basis  for  comparison.  We 
wanted  to  know  what  one  would  expect  in  a  normal  situation.  What  concen¬ 
tration  of  W's  would  one  expect  as  first  letters  of  subject  words  in  technical 
listings? 
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TABLE  1 


CONTINUOUS  MISSING  SERIES  OF  RELATED  TERM  REFERENCE 
IN  ENGINEERS  JOINT  COUNCIL  THESAURUS 


Descriptor  Missing  series 


Waves 

Waxes 

Weapons 

Wear 

Wear  test 

Wind 

Wind  measurement 
.  inti  tunnels 
A' mgs 
Wire  bar 

Wire  communication  systems 

Wiring 

Wood 

Wood  products 
Woodworking  machinery 


GHUKLM 

GHIJKLMNO 

HIJKL 

IJK 

EFGHIJKLMN 

EFGHIJKLMN 

EFGHIJKLMNOPQP.S  1 U 

IJKLMNOPQR 

DEFGHUKLMNOPOR 

IJKLMNOPQ 

EFGHIJKL 

FCHUKLMNOPORSTUVWXYZ 

DFGHIJK 

GHIJKL 

EFCHIJK 


Wool 


GHIJK 


In  1958,  H.  Ohlman  reported  on  subject  letter  frequencies  within 
a  variety  of  dictionaries  and  indexing  tools.  ^  A  portion  of  one  of  his  lists 
appears  as  table  2. 

Ohlman' s  tables  gave  an  expected  average  frequency  of  2,  3%  for 
the  letter  W  as  the  first  letter  of  subject  words.  Thus,  there  were  about 
nine  times  the  expected  frequency  of  W's  within  the  W's  in  EJC.  We  checked 
into  this  further  by  counting  the  number  of  I's  within  the  I  section.  It  was 
found  that  whereas,  according  to  Ohlman,  one  would  expect  an  average  fre¬ 
quency  of  3.8%  there  was  an  actual  frequency  of  11.  2%. 

It  was  known  that  the  EJC  list  was  generated  and  consolidated 
from  other  lists  on  a  somewhat  theoretical  basis.  This  was  done  in  part  by 
consultation  with  subject  specialists  in  order  to  determine  the  technical  re¬ 
latedness  of  terms.  The  DDC  thesaurus,  however,  was  compiled  on  a  docu¬ 
ment  basis.  Major  considerations  were  that  the  related  terms  represent 
actual  literature  in  the  collection,  and  in  the  opinion  of  the  indexers  they  are 
useful,  technically  correct,  and  solve  a  specific  indexing  problem.  They 
may  be  incorporated,  for  example,  where  a  word  that  is  not  a  synonym  is 
used  in  lieu  of  another  word. 

As  a  matter  of  speculation,  we  considered  the  possibility  that 
the  EJC  had  a  large  letter  within  letter  representation  because  subject 
specialist,  when  presented  with  a  word  beginning  with  a  particular  letter, 
would  naturally  relate  it  to  another  word  beginning  with  the  same  letter. 

Thus,  given  the  word  "weight,  "  the  specialist  might  tend  to  relate  it  to 
"weightlessness,  "  but  not  to  a  word  such  as  "gravity.  "  Similarly,  given 
the  word  "water- tube  boilers,  "  he  might  tend  to  relate  it  to  "water  pollution,  " 
'  water  quality,  "  and  the  I  ke. 

This  is  opposite  to  what  is  desired.  Related  term  references  are 
designed  to  refer  the  user  and  indexer  to  descriptors  he  would  not  think  of 
for  himself.  If  thesauri  refer  the  reference  librarian  or  scientist  to  de¬ 
scriptors  he  would  think  of  himself,  or  to  descriptors  in  close  physical  prox¬ 
imity,  which  he  would  normally  scan,  it  may  well  be  that  they  are  not  serving 
their  purpose.  Their  purpose  is  to  aid  the  user  by  providing  alternatives  and 
a  means  of  association  to  other  documents  in  the  collection. 

To  spot  check  whether  the  EJC  thesaurus  was  unusual  in  its  heavy 
concentration  of  first  letters,  the  number  of  terms  were  counted  beginning 
with  the  letter  W  within  the  W  section  of  the  DDC  thesaurus.  We  found  that 
the  DDC  thesaurus  more  closely  followed  the  Ohlman  pattern  with  5%  related 
term  beginning  with  W,  as  against  the  2.  3%  average  frequency  in  Ohlman. 
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IH.  DESCRIPTOR  FIRST  LETTER  FREQUENCY. 


At  this  point,  it  was  decided  to  go  into  a  thorough  study  of  the 
problem.  The  possibility  was  considered  that  in  addition  to  deviations 
because  of  word  repetition,  as  in  the  case  of  "weight"  and  "weightlessness,  " 
there  might  well  be  a  factor  for  letter  repetition.  We,  therefore,  committed 
ourselves  to  study  letter  repetition  first. 

As  a  first  step,  we  wished  to  know  how  closely  the  descriptors 
currently  used  in  the  technical  thesauri  themselves  followed  the  Ohlman 
pattern.  The  answer  should  be  of  some  concern  in  superimposed  coding 
and  perhaps  to  programming  economy.  4  However,  our  ultimate  aim  was 
comparison  with  the  first  letter  frequencies  of  the  related  terms. 

The  thesauri  selected  were:  the  Engineers  Joint  Council 
Thesaurus  of  Engineering  Terms,  the  Defense  Documentation  Center's  (DDC) 
Thesaurus  of  ASTLA  Descriptors,  ^  the  Medical  Subject  Headings  (MESH)  of 
the  National  Library  of  Medicine,  ^  and  the  Medical  and  Health  Related  Sciences 
Thesaurus  (MHRST)  of  the  Public  Health  Service.  ^  The  ASTLA  Thesaurus 
Code  Manual^  was  used  for  additional  validation  of  the  Thesaurus  of  ASTLA 
Descriptors.  The  study  consisted  of  counting  all  first  letter  frequencies  for 
descriptors  beginning  with  each  of  10  letters  of  the  alphabet.  Table  3  displays 
the  results  for  the  various  thesauri. 

An  inspection  of  this  table  indicates  a  similarity  in  the  frequencies 
of  the  given  letters  among  the  various  thesauri.  An  analysis  of  variance  test 
confirmed  this  and  showed  no  significant  difference  between  the  thesauri  in 
letter  frequencies. 

Not  only  were  they  similar  when  compared  with  each  other,  but 
very  little  difference  was  found  when  they  were  compared  to  Ohlman' s  list. 
Table  4  compares  the  results  of  table  3  with  Ohlman' s  list.  With  respect  to 
the  overall  range,  DDC  varies  the  least  with  a  range  of  -0.  3%  to  +0.  7%  with 
a  spread  of  1.0%.  The  EJC  had  a  range  of  -1.7%  to  +2.3%.  The  MESH  and 
the  MHRST  differed  the  most  with  respect  to  total.  Their  ranges  were 
5.4%  and  6.  7%,  respectively. 

Table  5  compares  the  frequency  rank  of  the  various  letters  for 
Ohlman  and  the  four  thesauri.  The  ranks  for  the  EJC  and  the  DDC  thesauri 
are  identical  with  that  of  Ohlman.  The  MESH  and  the  MHRST  differ  slightly, 
but  on  the  whole  they  are  still  similar.  Thus,  letters  A,  S,  and  M  are  still  the 
most  frequently  used  letters,  while  W,  K,  J,  and  Z  are  still  the  least  used. 
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age  letter  (10)10,515  *  <10)7,189 


RANK  ORDER  FOR  INITIAL  LETTER  FREQUENCIES 


It  was  thought  that  possibly  the  displacement  of  S  by  A  and  W  by  K  might  be 
caused  by  a  "sag"  at  the  terminal  end  of  the  alphabet.  It  was  found,  however, 
that  this  was  not  true.  Several  letters  at  the  end  of  the  alphabet  for  MHRST 
have  a  higher  frequency  than  found  on  the  Ohlman  list.  Results  for  these 
letters  were  T  -  7.49%,  U  -  1.57%,V  -  4.61%,  W  -  0.8%,  X  -  0.3%,  and 
Y  -  0.1%. 


In  general,  the  results  confirm  that  the  rank  distribution  for  the 
descriptors  of  the  four  thesauri  is  very  nimilar  to  those  of  the  tools  investi¬ 
gated  by  Ohlman.  For  the  DDC  and  EJC  thesauri,  one  could  use  his  average 
results  as  they  stand  in  superimposed  coding  in  lieu  of  the  results  for  the 
thesauri  with  no  loss.  A  slight  modification  would  be  necessary  for  MESH 
and  MHRST. 

Thus,  in  acquiring  background  for  the  study  of  related  term 
references,  we  found  that  for  these  thesauri,  regardless  of  other  differences, 
the  descriptors  themselves  follow  a  predictable  pattern  reported  in  previous 
work.  We  also  found  a  basis  for  comparison  of  the  frequency  of  related 
term  references  and  the  descriptor  list  proper. 

IV.  RELATED  TERM  FIRST  LETTER  FREQUENCY. 


It  was  now  possible  to  determine  whether  a  sample  of  related 
term  references  has  a  high  concentration  of  letter  within  letters  when  com¬ 
pared  with  both  the  various  thesauri  and  Ohlman.  We  were  also  in  a  position 
to  find  out  whether  EJC  was  unusual  in  this  regard  with  respect  to  the  other 
thesauri.  Ten  letters  were  selected  as  representing  infrequent,  frequent, 
and  intermediate  frequency  first  letters  based  on  the  Ohlman  distribution. 

For  each  letter,  frequencies  were  counted  for  occurrence  within  its  own  sec¬ 
tion;  that  is,  we  counted  the  related  term  W's  within  the  W  descriptors,  the 
A  related  terms  within  the  A  section,  and  so  on.  All  related  terms  were 
counted  for  each  letter  and  within  each  thesaurus.  The  results  are  presented 
in  table  6. 


The  results  confirm  that  the  EJC  thesaurus  contained  a  large  con¬ 
centration  of  letters  within  letters.  Within  the  "S"  section,  one  out  of  four 
related  terms  begins  with  the  letter  S.  Even  for  the  term  with  the  lowest 
concentration,  the  J's,  one  out  of  ten  begins  with  this  letter.  The  average 
concentration  of  individual  letters  turned  out  to  be  19.0%  (table  6),  while  the 
expected  concentration  is  4.6%  (table  3). 

It  appeared  then  that  the  first  letter  frequency  of  related  term  ref¬ 
erences  in  the  EJC  thesaurus  differed  from  the  first  letter  frequency  pattern 
of  its  own  descriptors,  of  descriptors  in  other  thesauri,  and  cf  r  ther  tools. 


16 


Table  6  was  somewhat  surprising  in  that  while  it  showed  EJC  to 
be  higher  than  expected  in  related  term  first  letter  frequency  it  also  showed 
that  the  other  thesauri  are  similarly  high.  The  EJC  thesaurus  contained  a 
concentration  of  letters  within  letters  that  was  almost  five  times  what  would 
be  expected.  The  average  for  all  thesauri  was  4-1/2  times  expected. 

While  this  result  is  biased  by  the  large  number  of  observations 
for  EJC,  the  overall  heavy  concentration  of  letters  within  letters  can  be  seen 
by  considering  each  thesaurus  separately  (table  7).  For  the  DDC  thesaurus, 
the  results  are  4-1/2  times  expected.  The  MESH  results  were  a  little  over 
2-1/2  times  expected.  For  the  MHRST;  overall  results  were  a  little  less 
than  2-1/2  times  expected. 

In  this  table,  there  are  a  number  of  outlying  observations  among 
the  J's,  K's,  and  Z's.  These  occur  usually  where  the  number  of  observations 
are  small.  As  the  sample  size  becomes  larger  (for  example,  the  A's,  S's, 
and  M's),  the  results  become  more  consistent  and  the  results  group  more 
closely  around  the  average. 

The  limited  number  of  observations  for  the  MESH  make  it  impos¬ 
sible  to  draw  general  conclusions  for  this  list.  The  unusual  definition  of  the 
"see"  references  in  the  MHRST  poses  smother  problem.  Under  the  definition, 
synonyms  and  related  terms  are  not  distinguished.  There  is  no  clear-cut 
way  of  separating  them;  however,  from  a  spot  check  of  the  synonyms  in  the 
other  thesauri,  it  was  felt  that  these  followed  the  expected  pattern.  Further, 
the  low  ratio  of  synonyms  to  related  terms  in  other  thesauri  would  seem  to 
indicate  that  the  contribution  would  be  small.  The  effect  of  subtracting  out 
the  synonym  references  m  order  to  determine  the  related  terms  would  be 
small  and  would  probably  be  to  bring  the  MHRST  closer  to  the  DDC  and  EJC 
thesauri.  EJC  could  also  be  brought  closer  to  the  others  by  considering  the 
broader  and  narrower  terms  to  be  related  terms  or  see  also's  (i.e.  ,  "per¬ 
missive"  references). 

Adding  interest  ic  this  result  (the  higher  than  expected  first 
letter  frequency)  is  the  fact  that  the  thesauri  differ  widely  in  their  method 
of  compilation,  the  subject  field  covered,  their  structural  characteristics, 
the  background  of  persons  compiling  them,  and  in  other  ways.  The  EJC 
thesaurus  was  compiled  mainly  by  engineers  and  scientists;  the  MESH  was 
compiled  by  professional  indexers.  The  EJC  thesaurus  was  compiled  more 
on  the  basis  of  submitted  lists  and  from  theoretical  considerations;  the  DDC. 
on  a  day  to  day  basis  from  the  actual  document.  The  DDC  thesaurus  is 
designed  for  the  government  report  literature;  the  MESH  covers  periodical 
articles  and  books.  There  are  many  other  differences. 
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V. 


WORD  REDUNDANCY. 


We  now  wished  to  know  the  extent  to  which  letter  redundancy 
is  a  representation  of  word  redundancy.  How  much  does  word  repetition 
contribute  to  letter  repetition  within  the  related  term  references?  if  we 
eliminate  the  effect  of  word  repetition,  would  the  first  letter  frequency 
pattern  of  the  residue  now  follow  the  Ohlman  descriptors  in  rank  and 
quantity? 


In  order  to  determine  the  above,  we  counted  all  related  terms 
where  initial  words  repeated  the  words  of  their  own  descriptors.  Thus, 
the  related  terms  "water  resources"  and  "water  tanks"  were  considered 
repetitious  when  included  under  "water  supply." 

The  results  are  presented  in  table  8.  The  overall  contribution 
of  word  repetition  to  letter  repetition  is  one -half.  The  extent  of  word 
repetition  is  about  the  same  for  EJC  and  AST  LA .  The  MHRST  is  considera¬ 
bly  lower  in  this  respect. 

Table  9  shows  what  happens  when  we  eliminate  word  repetition 
from  letter  repetition.  The  results  for  each  thesaurus  are  still  considerably 
higher  than  Ohlman.  The  residue  of  9.4%  is  twice  that  of  the  expected 
distribution.  In  other  words,  subtraction  of  word  repetition  brings  first 
letter  repetition  to  a  point  about  halfway  to  Ohlman.  This  leads  us  to  the 
conclusion  that  there  is  a  factor  in  producing  a  higher  initial  letter  repeti¬ 
tion  that  cannot  be  accounted  for  by  word  repetition. 

VI.  UNKNOWN  FACTOR. 

In  table  10.  we  rank  the  letters  appearing  in  table  9.  We  find 
that  the  rank  order  returns  to  a  closer  replica  of  the  usual  pattern  liable  5), 
Therefore,  we  can  say  that  an  unknown  factor  has  resulted  m  an  excess  of 
repeated  first  letters.  This  factor  results  in  a  small  disruption  of  the 
rank  order  of  first  letter  frequencies  (table  II.  rank  order  of  related  term 
first  letters  without  word  repetition).  There  is  an  additional  factor  re¬ 
sulting  from  word  repetition  that  results  in  a  greater  disrupticr.  of  the 
"normal  pattern. " 

The  summary  (table  !£)  shows  the  contribution  of  each  of  these 
factors  to  the  overall  deviation  from  the  descriptors  of  each  thesaurus. 

The  EJC  is  heaviest  in  the  unknown  factor.  The  factor  is  evenly  distributed 
among  the  10  letters  in  EJC;  that  is.  all  letters  are  consistently  higher  than 
expected  after  subtraction  of  word  repetition.  Ln  the  other  thesauri,  they 
are  not. 


^0 


RANK  OF  RELATED  TERMS  INITIAL  LETTER  FREQUENCIES 
(Letters  within  their  own  section) 


mdkm  dfci 


mm* 


TABLE  12 


This  factor  may  be  due  to  a  natural  technical  relatedness,  or 
to  a  psychological  factor.  We  would  want  to  know  more  about  this  before 
making  value  judgements  about  the  desirability  of  word  redundancy  and 
letter  redundancy. 

We  did  one  additional  piece  of  work  on  this  initial  phase.  We 
had  tentatively  accepted  that  the  I,  J,  K,  and  L's  appeared  short  within  the 
W  section  because  of  the  surplus  of  W's.  We  demonstrated  that  there  were 
a  large  amount  of  W's  when  compared  with  usual  frequency  patterns.  This 
does  not  definitely  prove  whether  this  is  the  reason  for  the  apparent  IJKL 
shortage.  In  order  to  confirm  our  suspicions,  we  wished  to  confirm 
whether  these  and  the  other  letters  in  the  W  section  would  display  the  nor¬ 
mal  pattern  when  the  W's  were  eliminated. 

We  did  this  by  counting  all  related  term  references  for  all  26 
letters  of  the  alphabet  in  the  W  section.  With  the  exception  of  the  W's,  all 
letters  were  consistently  lower  in  frequency  than  Ohlman  and  EJC  descrip¬ 
tors.  With  the  exception  of  the  W's,  the  10  letters  follow  the  general  rank 
order  of  Ohlman  and  EJC  descriptors.  The  order  within  the  W  section  is 
W,  S,  A,  M,  E,  L,  I,  K,  J,  and  Z.  For  the  26  letters  of  the  alphabet  the 
rank  correlation  between  the  frequencies  of  the  W  section  and  Ohlman 
was  0.85.  If  the  W's  and  T's  are  eliminated,  correlation  it  0.93.  The 
quantitative  effect  of  the  W's  is,  of  course,  greater  than  shown  by  rank 
correlation  since  their  occurrence  within  the  W's  is  20%  of  the  total. 

VII.  CONCLUSIONS. 

This  investigation  was  initiated  in  order  to  determine  the  reason 
for  apparent  shortage  of  I,  J,  K,  and  L's  in  the  related  term  reference  of 
the  EJC  thesaurus.  We  decided  to  expand  into  a  study  of  the  statistical 
characteristics  of  first  letters  of  related  terms  in  four  thesauri. 

In  the  course  of  the  investigation,  we  found  that,  in  general, 
the  distribution  of  descriptors  in  four  thesauri  followed  the  pattern  previ¬ 
ously  observed  by  Ohlman. 

The  initial  letter  frequency  for  related  terms  within  given 
letters  did  not  follow  this  pattern.  The  major  reason  seemed  to  be  letter 
within  letter  redundancy.  Related  first  letters  repeating  the  first  letter 
of  their  own  descriptors  are  over  four  times  expected  when  considered 
against  Ohlman  and  the  descriptors  themselves. 
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For  the  EJC  thesaurus,  we  found  a  wide  difference  between  the 
frequencies  of  the  first  letters  of  the  related  terms  within  given  sections  and 
the  frequencies  of  the  descriptors.  The  EJC  thesaurus  v  as  not  unusual  in 
this  characteristic  when  compared  to  the  other  thesauri.  It  was,  however, 
more  repetitious  of  the  same  first  letters  than  the  others. 

We  began  work  on  a  study  of  the  extent  to  which  this  letter 
redundancy  is  a  result  of  word  redundancy.  We  found  that  word  repetition 
only  accounted  for  one -half  of  the  "excess"  of  first  letters.  There  are, 
then,  factors  additional  to  word  repetition  contributing  to  the  "excess." 

The  EJC  and  ASTIA  thesauri  were  heaviest  in  word  repetition. 
The  EJC  thesaurus  was  heaviest  in  letter  repetition  and  in  the  additional 
factor  contributing  to  the  excess  of  letters  within  letters. 
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Related  term  references  are  a  principal  dd  in  retrieving  documents  and  Information 
They  provide  the  user  and  indexer  vitn  alternatives,  and  suggestions  and  are  a 
means  of  association  betveen  Items  in  the  collection.  In  order  to  better  under¬ 
stand  their  nature  and  the  differences  in  practice  in  their  incorporation,  ve 
studied  the  statistical  characteristics  of  related  term  references  in  four  thesauri 
Our  method  was  to  compare  the  first  letter  frequencies  in  these  thesauri  with  each 
other,  vith  the  first  letter  frequencies  of  their  descriptors  and  with  the  pattern 
noted  by  other  investigators.  The  thesauri  selected  were  The  Engineers'  Joint 
Council  Thesaurus  of  Engineering  Terms  (EJC),  the  Defense  Documentation  Center’s 
Thesaurus  of  Asti  a  Descriptors  ( ASTIA),  the  Medical  Subject  Headings  of  the 
National  Library  of  Medicine  (MESH);  and  the  Medical  and  Health  Related  Sciences 
Thesaurus  of  the  Public  Health  Service  (MHRST).  We  found  that  the  related  terms 
did  not  follow  the  first  letter  frequency  pattern  of  their  own  descriptors  or  of 
that  reported  in  the  literature.  The  principal  difference  was  in  redundancy  of 
letters  vlthin  their  own  letter  section.  The  thesauri  were  fairly  consistent  in 
thlb  difference.  In  addition  to  word  redundancy,  there  seemed  to  be  an  additional 
factor  resulting  in  the  redundancy. 
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